Mapd’O Network metrics exploration

Progress meeting 22.04.2024

Leo Helling

Points

  • Dataset overview

  • Variable cleaning

  • Principal Component Analysis

  • K-means Clustering

  • Hidden-Markov-Modeling

  • Next steps

Dataset overview

Dataset containing only the normalized variables for the land use and lateral continuity (area normalized to the valley bottom area)

Variable cleaning

Based on high values of correlation and similarities in the PCA, the following variables are removed:

  • floodplain_slope as it is represented well by talweg_slope
  • gravel_bars_pc as it is represented well in active_channel_pc
  • water_channel_width as it is represented well in active_channel_width
  • valley_bottom_width by sum_area
  • semi_natural_pc as it is falsely calculated and only represents grassland_pc
  • reversible_pc as as it is falsely calculated and only represents grassland_pc and crops_pc
  • infrastructures_pc, dense_urban_pc, and diffuse_urban_pc are well represented by built_environment_pc
  • natural_corridor_width represented well by connected_corridor_width

PCA

According to the results, the first four principal components are sufficient to represent 64.5 % of the variability of the data set. In the following, each of these PCs is analysed according to the individual association of the variables to them in order to facilitate interpretation.

Characteristics of principal components
PC Description
PC1

Positive values indicate large rivers in wide valleys with low slopes and low elevations, with comparably small riparian corridor and diverse anthropogenic activity in the adjacent areas.

Negative values indicate smaller rivers in narrow valleys with higher slopes and elevations, with a greater relative area for the riparian corridor and less activity in the adjacent areas.

  • positive variable importance: sum_area, strahler, connected_corridor_width, active_channel_width, crops_pc, disconnected_pc

  • negative variable importance:talweg_slope, talweg_elevation_min, riparian_corridor_pc

PC2

Positive values indicate rather narrow valleys in which most of the space is taken by the water channel with few space for the connected corridor and crops.

Negative values indicate wide valleys with smaller channel width to valley width ratios and larger shares of connected corridor and crops.

  • positive VI: water_channel_pc, idx_confinement, active_channel_width

  • negative VI: sum_area, crops_pc, disconnected_pc, connected_corridor_width

PC3

Positive values indicate comparably large and forested riparian corridors in lower elevations with few grassland and natural open area.

Negative values thus indicate comparably small and unforested riparian corridors in higher elevations and with more natural open areas and grasslands.

  • positive VI: riparian_corridor_pc, forest_pc

  • negative VI: talweg_elevation_min, natural_open_pc, grassland_pc

PC4

Positive values indicate rather smaller, confined streams with a strong presence of anthropogenic infrastructure.

Negative values thus indicate comparably larger rivers with more space for the active channel and no presence of built/anthropogenic infrastructure in the adjacent zones.

  • positive VI: riparian_corridor_pc, forest_pc

  • negative VI: talweg_elevation_min, natural_open_pc, grassland_pc, idx_confinement, crops_pc, active_channel_width, connected_corridor_width

K-means Clustering

K-means is a clustering method that generates clusters based on the search for centers of gravity to which the mean distance from the associated data points is minimized. In order to apply this method, the number of clusters must first be determined. For this purpose, 24 different indices were evaluated using the NBClust-package. Among all indices:

  • 4 proposed 2 as the best number of clusters
  • 5 proposed 3 as the best number of clusters
  • 2 proposed 4 as the best number of clusters
  • 10 proposed 5 as the best number of clusters
  • 1 proposed 8 as the best number of clusters
  • 1 proposed 9 as the best number of clusters
  • 1 proposed 10 as the best number of clusters

According to the majority rule, the best number of clusters is 5.

Based on the data-distributions, the main characteristics of the clusters are summarized in the following table:

Characteristics of clusters
Cluster Derived characteristics
1

Rivers confined by anthropogenized floodplain

Rather confined, lower elevation rivers with altered riparian zone including diverse usages such as urban and agricultural infrastructure.

  • above average values: PC4, PC1
2

Larger rivers with agricultural landscape

Larger rivers in wide valleys with low slopes and low elevations, with semi-intensive riparian corridor use due to agricultural activity.

  • above average values: PC1

  • below average: PC2, PC4

3

Small upstream rivers

Smaller and unforested riparian corridors in higher elevations and with more natural open areas and grasslands and less activity in the adjacent areas.

  • below average values: PC1, PC3
4

Forested medium-sized rivers

Large and forested riparian corridors in lower elevations with few grassland and natural open area.

  • above average values: PC3
5

Diverse medium-sized and large rivers

Medium-sized and larger streams in lower elevations with different landuse patterns and active channel sizes.

  • above average values: PC1, PC2, PC3

Hidden-Markov-Model

3-state HMM applied to the cluster series of the Isère River. Modeling is done via the HMM-package, using the Baum-Welch algorithm to fit the model and the Viterbi algorithm to compute most probable path of states.

Allows for multivariate modelling of hidden markov chains. First tries below, comparison plot as following:

  1. Cluster series on Isère

  2. HMM as previously presented

  3. depmix-model based on clusters

  4. depmix-model based on direct values of the four first principal components

converged at iteration 55 with logLik: -7424.351 

Next steps

  • remove “holes” in series-representation to fit longitudanal extent of stream segments well

  • use the usethis-package to hide password of database access

  • advance with dependent mixture / HMM model :

  • start learning R-Shiny development (e.g. with Lise’s course and ThinkR material)